/home/ahmed/.local/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
ax.set_xlabel("Weight (in lbs)")plt.show()
🤔 What’s the name of this graph?
Histogram
A histogram is an accurate representation of the distribution of numerical data.
It is an estimate of the probability distribution of a continuous variable.
⚠️ Histogram (continuous univariate) ≠ Bar chart (categorical + numerical)
Cumulative plots
Alternatively, we can plot the count of weights inferior to a certain value.
Instead of the counts, we can also plot the density (sums up to 1) or percentage (sums up to 100).
/home/ahmed/.local/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
plt.show()
2️⃣ Summary statistics
🎯 Goal: communicate the largest amount of information about the distribution of data in the simplest possible way.
Typical summary statistics include measures of:
Location / central tendency (e.g. mean)
Statistical Dispersion / spread (e.g. variance)
Shape of the distribution (e.g. skewness & kurtosis)
Linear correlation of two variables X and Y
Population vs Sample
Mean
The mean of a population of N elements is defined by:
For a sample of a population, a good estimator of \(\sigma\) is the sample standard deviation :
\[
s = \sqrt {{\frac {1}{n-1}}\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}
\]
🤔 \({\frac{1}{n}}\) would give an underestimate of the true population variance (Bessel’s correction)
Interquartile range (IQR)
It’s the difference between upper and lower quartiles. \(\text{IQR} = Q_3 - Q_1\)
💡 It can be used to identify outliers in the data set: they are defined as observations that fall below \(Q_1\) - 1.5 IQR or above \(Q_3\) + 1.5 IQR. 📈 IQR is very useful for boxplots!
Corr(X,Y) = 0⇏(X,Y) independent (as \(r\) only captures linear dependance)
️Probability Distributions
We assign a probability measure \(P(A)\) to an event \(A\). This is a value between 0 and 1 that shows how likely the event is.
A probability distribution is a mathematical function that describes the probability of different possible values of a variable.
Examples
Take the returns of a financial time series:
# Generate series from start of 2016 to end of 2020fig = sns.kdeplot(df.returns, fill=True, color="r")
/home/ahmed/.local/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
plt.show()
Well-Known distributions
Normal Distribution
Exponential Distribution
Uniform Distribution
Binomial Distribution
Normal Distribution
Most important distribution due to the central limit theorem: every variable that can be modelled as a sum of many small independent, identically distributed variables with finite mean and variance is approximately normal.
It is entirely described by just two parameters: \(\mu, \sigma\)
\[{\mathcal N(\mu, \sigma)} \]
import math from scipy import statsdef plot_normal_distribution(mu, variance): sigma = math.sqrt(variance) x = np.linspace(-10, 10, 100) plt.plot(x, stats.norm.pdf(x, mu, sigma), label=f"μ={mu}, σ²={variance}")plot_normal_distribution(0, 1)plot_normal_distribution(1, 2)plot_normal_distribution(-3, 5)plt.legend()plt.show()
PDF & CDF
PDF
A probability density function (PDF) tells us the probability that a random variable takes on a certain value.
CDF
A cumulative distribution function (CDF) tells us the probability that a random variable takes on a value less than or equal to x.
Generate a normal distribution
import scipy.stats as spnspn.norm(loc=100, scale=12)#where loc is the mean and scale is the std dev
<scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7ff5e0181a20>
Find P LESS than
#To find the probability that the variable has a value LESS than or equal#let's say 113, you'd use CDF cumulative Density Functionspn.norm.cdf(113,100,12)
0.8606697525503779
Find P GREATER than
#To find the probability that the variable has a value GREATER than or#equal to let's say 125, you'd use SF Survival Function spn.norm.sf(125,100,12)
0.018610425189886332
Given P find the value
#To find the variate for which the probability is given, let's say the #value which needed to provide a 98% probability, you'd use the #PPF Percent Point Functionspn.norm.ppf(.98,100,12)
☝️ The goal is to assign a probability to events, defined as subsets of a sample space \(S\).
Conditional Probability
Likelihood of an event occurring given that another event has already occurred.
It quantifies the probability of one event happening under the condition that we know a related event has taken place.
The conditional probability of event A occurring given that event B has occurred is denoted as P(A|B) and is calculated as:
\[P(A|B) = P(A ∩ B) / P(B)\]
where:
\(P(A|B)\) is the conditional probability of event A given event B.
\(P(A ∩ B)\) is the probability that both event A and event B occur simultaneously (the intersection of A and B).
\(P(B)\) is the probability of event B occurring.
In simpler terms, it answers the question: “What is the probability that event A happens if we know that event B has already occurred?”
Bayes’ Theorem
Bayes’ Theorem is a formula used in probability to update our beliefs about an event based on new evidence. It helps us find the probability of an event happening, given the probability of another related event.
\[P(A\mid B)={\frac {P(B\mid A) \cdot P(A)}{P(B)}}\] Where:
\(P(A|B)\) is the probability of event A happening, given that event B has occurred.
\(P(B|A)\) is the probability of event B happening, given that event A has occurred.
\(P(A)\) is the probability of event A happening.
\(P(B)\) is the probability of event B happening.
Bayes’ Theorem allows us to update our belief in event A (posterior probability) based on the observed evidence from event B.
Example
Let’s consider a medical example to understand how Bayes’ Theorem works:
Suppose there is a rare disease that affects 1% of the population. A medical test has been developed to detect this disease, and the test is 95% accurate. This means that the test will correctly identify a person with the disease 95% of the time (true positive) and will correctly identify a healthy person 95% of the time (true negative).
However, there is also a chance of false positives and false negatives. Specifically, the test incorrectly identifies a healthy person as having the disease 5% of the time (false positive), and it incorrectly identifies a person with the disease as healthy 5% of the time (false negative).
Question: If a randomly selected person tests positive for the disease, what is the probability that they actually have the disease?
Given information:
P(Disease) = 0.01 (probability of having the disease)
P(Positive|Disease) = 0.95 (probability of testing positive given the person has the disease)
P(Positive|No Disease) = 0.05 (probability of testing positive given the person does not have the disease)
def bayes_theorem(p_disease, p_positive_given_disease, p_positive_given_no_disease): p_no_disease =1- p_disease p_positive = (p_positive_given_disease * p_disease) + (p_positive_given_no_disease * p_no_disease) p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positivereturn p_disease_given_positive# Given informationp_disease =0.01p_positive_given_disease =0.95p_positive_given_no_disease =0.05# Calculate the probability of having the disease given a positive test resultprobability_disease_given_positive = bayes_theorem(p_disease, p_positive_given_disease, p_positive_given_no_disease)# Convert probability to percentagepercentage_disease_given_positive = probability_disease_given_positive *100print(f"The probability of having the disease given a positive test result is approximately {percentage_disease_given_positive:.2f}%.")
The probability of having the disease given a positive test result is approximately 16.10%.